Group 6-CV 1:
Contributor - Ashish Tiwari, Kanishk Sanger, Ninad Mahajan, Tushar Bisht, Avinash Kumra Sharma
Mentor - Mr. Rohit raj
Pneumonia is an infection in one or both lungs. Bacteria, viruses, and fungi cause it. The infection causes inflammation in the air sacs in your lungs, which are called alveoli.
CXRs are the most commonly performed diagnostic imaging study. A number of factors such as positioning of the patient and depth of inspiration can alter the appearance of the CXR, complicating interpretation further. In addition, clinicians are faced with reading high volumes of images every shift.
Automating Pneumonia screening in chest radiographs, providing affected area details through bounding box.
The objective of this project is to build an algorithm to locate the position of inflammation in a medical image. The algorithm needs to locate lung opacities on chest radiographs automatically
The objective of the project is,
Acknowledgment for the datasets: https://www.kaggle.com/c/rsna-pneumonia-detection-challenge/overview/acknowledgements
# gpu_info = !nvidia-smi
# gpu_info = '\n'.join(gpu_info)
#if gpu_info.find('failed') >= 0:
# print('Select the Runtime > "Change runtime type" menu to enable a GPU accelerator, ')
#print('and then re-execute this cell.')
#else:
# print(gpu_info)
from google.colab import drive
drive.mount('/content/drive')
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Enable the below installation packages if not already installed
!pip install tensorflow
!pip install tensorflow-gpu
!pip install tqdm
!pip install pydicom
!pip install -U albumentations
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/ Requirement already satisfied: tensorflow in /usr/local/lib/python3.7/dist-packages (2.9.1) Requirement already satisfied: flatbuffers<2,>=1.12 in /usr/local/lib/python3.7/dist-packages (from tensorflow) (1.12) Requirement already satisfied: opt-einsum>=2.3.2 in /usr/local/lib/python3.7/dist-packages (from tensorflow) (3.3.0) Requirement already satisfied: libclang>=13.0.0 in /usr/local/lib/python3.7/dist-packages (from tensorflow) (14.0.1) Requirement already satisfied: absl-py>=1.0.0 in /usr/local/lib/python3.7/dist-packages (from tensorflow) (1.1.0) Requirement already satisfied: termcolor>=1.1.0 in /usr/local/lib/python3.7/dist-packages (from tensorflow) (1.1.0) Requirement already satisfied: typing-extensions>=3.6.6 in /usr/local/lib/python3.7/dist-packages (from tensorflow) (4.1.1) Requirement already satisfied: h5py>=2.9.0 in /usr/local/lib/python3.7/dist-packages (from tensorflow) (3.1.0) Requirement already satisfied: protobuf<3.20,>=3.9.2 in /usr/local/lib/python3.7/dist-packages (from tensorflow) (3.19.4) Requirement already satisfied: astunparse>=1.6.0 in /usr/local/lib/python3.7/dist-packages (from tensorflow) (1.6.3) Requirement already satisfied: numpy>=1.20 in /usr/local/lib/python3.7/dist-packages (from tensorflow) (1.21.6) Requirement already satisfied: six>=1.12.0 in /usr/local/lib/python3.7/dist-packages (from tensorflow) (1.15.0) Requirement already satisfied: packaging in /usr/local/lib/python3.7/dist-packages (from tensorflow) (20.9) Requirement already satisfied: tensorflow-estimator<2.10.0,>=2.9.0rc0 in /usr/local/lib/python3.7/dist-packages (from tensorflow) (2.9.0) Requirement already satisfied: tensorflow-io-gcs-filesystem>=0.23.1 in /usr/local/lib/python3.7/dist-packages (from tensorflow) (0.26.0) Requirement already satisfied: google-pasta>=0.1.1 in /usr/local/lib/python3.7/dist-packages (from tensorflow) (0.2.0) Requirement already satisfied: gast<=0.4.0,>=0.2.1 in /usr/local/lib/python3.7/dist-packages (from tensorflow) (0.4.0) Requirement already satisfied: wrapt>=1.11.0 in /usr/local/lib/python3.7/dist-packages (from tensorflow) (1.14.1) Requirement already satisfied: keras-preprocessing>=1.1.1 in /usr/local/lib/python3.7/dist-packages (from tensorflow) (1.1.2) Requirement already satisfied: keras<2.10.0,>=2.9.0rc0 in /usr/local/lib/python3.7/dist-packages (from tensorflow) (2.9.0) Requirement already satisfied: tensorboard<2.10,>=2.9 in /usr/local/lib/python3.7/dist-packages (from tensorflow) (2.9.1) Requirement already satisfied: setuptools in /usr/local/lib/python3.7/dist-packages (from tensorflow) (57.4.0) Requirement already satisfied: grpcio<2.0,>=1.24.3 in /usr/local/lib/python3.7/dist-packages (from tensorflow) (1.46.3) Requirement already satisfied: wheel<1.0,>=0.23.0 in /usr/local/lib/python3.7/dist-packages (from astunparse>=1.6.0->tensorflow) (0.37.1) Requirement already satisfied: cached-property in /usr/local/lib/python3.7/dist-packages (from h5py>=2.9.0->tensorflow) (1.5.2) Requirement already satisfied: google-auth-oauthlib<0.5,>=0.4.1 in /usr/local/lib/python3.7/dist-packages (from tensorboard<2.10,>=2.9->tensorflow) (0.4.6) Requirement already satisfied: werkzeug>=1.0.1 in /usr/local/lib/python3.7/dist-packages (from tensorboard<2.10,>=2.9->tensorflow) (1.0.1) Requirement already satisfied: markdown>=2.6.8 in /usr/local/lib/python3.7/dist-packages (from tensorboard<2.10,>=2.9->tensorflow) (3.3.7) Requirement already satisfied: tensorboard-data-server<0.7.0,>=0.6.0 in /usr/local/lib/python3.7/dist-packages (from tensorboard<2.10,>=2.9->tensorflow) (0.6.1) Requirement already satisfied: requests<3,>=2.21.0 in /usr/local/lib/python3.7/dist-packages (from tensorboard<2.10,>=2.9->tensorflow) (2.23.0) Requirement already satisfied: google-auth<3,>=1.6.3 in /usr/local/lib/python3.7/dist-packages (from tensorboard<2.10,>=2.9->tensorflow) (1.35.0) Requirement already satisfied: tensorboard-plugin-wit>=1.6.0 in /usr/local/lib/python3.7/dist-packages (from tensorboard<2.10,>=2.9->tensorflow) (1.8.1) Requirement already satisfied: cachetools<5.0,>=2.0.0 in /usr/local/lib/python3.7/dist-packages (from google-auth<3,>=1.6.3->tensorboard<2.10,>=2.9->tensorflow) (4.2.4) Requirement already satisfied: rsa<5,>=3.1.4 in /usr/local/lib/python3.7/dist-packages (from google-auth<3,>=1.6.3->tensorboard<2.10,>=2.9->tensorflow) (4.8) Requirement already satisfied: pyasn1-modules>=0.2.1 in /usr/local/lib/python3.7/dist-packages (from google-auth<3,>=1.6.3->tensorboard<2.10,>=2.9->tensorflow) (0.2.8) Requirement already satisfied: requests-oauthlib>=0.7.0 in /usr/local/lib/python3.7/dist-packages (from google-auth-oauthlib<0.5,>=0.4.1->tensorboard<2.10,>=2.9->tensorflow) (1.3.1) Requirement already satisfied: importlib-metadata>=4.4 in /usr/local/lib/python3.7/dist-packages (from markdown>=2.6.8->tensorboard<2.10,>=2.9->tensorflow) (4.11.4) Requirement already satisfied: zipp>=0.5 in /usr/local/lib/python3.7/dist-packages (from importlib-metadata>=4.4->markdown>=2.6.8->tensorboard<2.10,>=2.9->tensorflow) (3.8.0) Requirement already satisfied: pyasn1<0.5.0,>=0.4.6 in /usr/local/lib/python3.7/dist-packages (from pyasn1-modules>=0.2.1->google-auth<3,>=1.6.3->tensorboard<2.10,>=2.9->tensorflow) (0.4.8) Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /usr/local/lib/python3.7/dist-packages (from requests<3,>=2.21.0->tensorboard<2.10,>=2.9->tensorflow) (1.24.3) Requirement already satisfied: chardet<4,>=3.0.2 in /usr/local/lib/python3.7/dist-packages (from requests<3,>=2.21.0->tensorboard<2.10,>=2.9->tensorflow) (3.0.4) Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.7/dist-packages (from requests<3,>=2.21.0->tensorboard<2.10,>=2.9->tensorflow) (2022.6.15) Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.7/dist-packages (from requests<3,>=2.21.0->tensorboard<2.10,>=2.9->tensorflow) (2.10) Requirement already satisfied: oauthlib>=3.0.0 in /usr/local/lib/python3.7/dist-packages (from requests-oauthlib>=0.7.0->google-auth-oauthlib<0.5,>=0.4.1->tensorboard<2.10,>=2.9->tensorflow) (3.2.0) Requirement already satisfied: pyparsing>=2.0.2 in /usr/local/lib/python3.7/dist-packages (from packaging->tensorflow) (3.0.9) Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/ Requirement already satisfied: tensorflow-gpu in /usr/local/lib/python3.7/dist-packages (2.9.1) Requirement already satisfied: tensorboard<2.10,>=2.9 in /usr/local/lib/python3.7/dist-packages (from tensorflow-gpu) (2.9.1) Requirement already satisfied: gast<=0.4.0,>=0.2.1 in /usr/local/lib/python3.7/dist-packages (from tensorflow-gpu) (0.4.0) Requirement already satisfied: setuptools in /usr/local/lib/python3.7/dist-packages (from tensorflow-gpu) (57.4.0) Requirement already satisfied: numpy>=1.20 in /usr/local/lib/python3.7/dist-packages (from tensorflow-gpu) (1.21.6) Requirement already satisfied: wrapt>=1.11.0 in /usr/local/lib/python3.7/dist-packages (from tensorflow-gpu) (1.14.1) Requirement already satisfied: google-pasta>=0.1.1 in /usr/local/lib/python3.7/dist-packages (from tensorflow-gpu) (0.2.0) Requirement already satisfied: tensorflow-io-gcs-filesystem>=0.23.1 in /usr/local/lib/python3.7/dist-packages (from tensorflow-gpu) (0.26.0) Requirement already satisfied: flatbuffers<2,>=1.12 in /usr/local/lib/python3.7/dist-packages (from tensorflow-gpu) (1.12) Requirement already satisfied: typing-extensions>=3.6.6 in /usr/local/lib/python3.7/dist-packages (from tensorflow-gpu) (4.1.1) Requirement already satisfied: h5py>=2.9.0 in /usr/local/lib/python3.7/dist-packages (from tensorflow-gpu) (3.1.0) Requirement already satisfied: termcolor>=1.1.0 in /usr/local/lib/python3.7/dist-packages (from tensorflow-gpu) (1.1.0) Requirement already satisfied: keras<2.10.0,>=2.9.0rc0 in /usr/local/lib/python3.7/dist-packages (from tensorflow-gpu) (2.9.0) Requirement already satisfied: packaging in /usr/local/lib/python3.7/dist-packages (from tensorflow-gpu) (20.9) Requirement already satisfied: protobuf<3.20,>=3.9.2 in /usr/local/lib/python3.7/dist-packages (from tensorflow-gpu) (3.19.4) Requirement already satisfied: libclang>=13.0.0 in /usr/local/lib/python3.7/dist-packages (from tensorflow-gpu) (14.0.1) Requirement already satisfied: opt-einsum>=2.3.2 in /usr/local/lib/python3.7/dist-packages (from tensorflow-gpu) (3.3.0) Requirement already satisfied: tensorflow-estimator<2.10.0,>=2.9.0rc0 in /usr/local/lib/python3.7/dist-packages (from tensorflow-gpu) (2.9.0) Requirement already satisfied: keras-preprocessing>=1.1.1 in /usr/local/lib/python3.7/dist-packages (from tensorflow-gpu) (1.1.2) Requirement already satisfied: astunparse>=1.6.0 in /usr/local/lib/python3.7/dist-packages (from tensorflow-gpu) (1.6.3) Requirement already satisfied: grpcio<2.0,>=1.24.3 in /usr/local/lib/python3.7/dist-packages (from tensorflow-gpu) (1.46.3) Requirement already satisfied: six>=1.12.0 in /usr/local/lib/python3.7/dist-packages (from tensorflow-gpu) (1.15.0) Requirement already satisfied: absl-py>=1.0.0 in /usr/local/lib/python3.7/dist-packages (from tensorflow-gpu) (1.1.0) Requirement already satisfied: wheel<1.0,>=0.23.0 in /usr/local/lib/python3.7/dist-packages (from astunparse>=1.6.0->tensorflow-gpu) (0.37.1) Requirement already satisfied: cached-property in /usr/local/lib/python3.7/dist-packages (from h5py>=2.9.0->tensorflow-gpu) (1.5.2) Requirement already satisfied: tensorboard-data-server<0.7.0,>=0.6.0 in /usr/local/lib/python3.7/dist-packages (from tensorboard<2.10,>=2.9->tensorflow-gpu) (0.6.1) Requirement already satisfied: markdown>=2.6.8 in /usr/local/lib/python3.7/dist-packages (from tensorboard<2.10,>=2.9->tensorflow-gpu) (3.3.7) Requirement already satisfied: requests<3,>=2.21.0 in /usr/local/lib/python3.7/dist-packages (from tensorboard<2.10,>=2.9->tensorflow-gpu) (2.23.0) Requirement already satisfied: google-auth<3,>=1.6.3 in /usr/local/lib/python3.7/dist-packages (from tensorboard<2.10,>=2.9->tensorflow-gpu) (1.35.0) Requirement already satisfied: werkzeug>=1.0.1 in /usr/local/lib/python3.7/dist-packages (from tensorboard<2.10,>=2.9->tensorflow-gpu) (1.0.1) Requirement already satisfied: google-auth-oauthlib<0.5,>=0.4.1 in /usr/local/lib/python3.7/dist-packages (from tensorboard<2.10,>=2.9->tensorflow-gpu) (0.4.6) Requirement already satisfied: tensorboard-plugin-wit>=1.6.0 in /usr/local/lib/python3.7/dist-packages (from tensorboard<2.10,>=2.9->tensorflow-gpu) (1.8.1) Requirement already satisfied: rsa<5,>=3.1.4 in /usr/local/lib/python3.7/dist-packages (from google-auth<3,>=1.6.3->tensorboard<2.10,>=2.9->tensorflow-gpu) (4.8) Requirement already satisfied: cachetools<5.0,>=2.0.0 in /usr/local/lib/python3.7/dist-packages (from google-auth<3,>=1.6.3->tensorboard<2.10,>=2.9->tensorflow-gpu) (4.2.4) Requirement already satisfied: pyasn1-modules>=0.2.1 in /usr/local/lib/python3.7/dist-packages (from google-auth<3,>=1.6.3->tensorboard<2.10,>=2.9->tensorflow-gpu) (0.2.8) Requirement already satisfied: requests-oauthlib>=0.7.0 in /usr/local/lib/python3.7/dist-packages (from google-auth-oauthlib<0.5,>=0.4.1->tensorboard<2.10,>=2.9->tensorflow-gpu) (1.3.1) Requirement already satisfied: importlib-metadata>=4.4 in /usr/local/lib/python3.7/dist-packages (from markdown>=2.6.8->tensorboard<2.10,>=2.9->tensorflow-gpu) (4.11.4) Requirement already satisfied: zipp>=0.5 in /usr/local/lib/python3.7/dist-packages (from importlib-metadata>=4.4->markdown>=2.6.8->tensorboard<2.10,>=2.9->tensorflow-gpu) (3.8.0) Requirement already satisfied: pyasn1<0.5.0,>=0.4.6 in /usr/local/lib/python3.7/dist-packages (from pyasn1-modules>=0.2.1->google-auth<3,>=1.6.3->tensorboard<2.10,>=2.9->tensorflow-gpu) (0.4.8) Requirement already satisfied: chardet<4,>=3.0.2 in /usr/local/lib/python3.7/dist-packages (from requests<3,>=2.21.0->tensorboard<2.10,>=2.9->tensorflow-gpu) (3.0.4) Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.7/dist-packages (from requests<3,>=2.21.0->tensorboard<2.10,>=2.9->tensorflow-gpu) (2.10) Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /usr/local/lib/python3.7/dist-packages (from requests<3,>=2.21.0->tensorboard<2.10,>=2.9->tensorflow-gpu) (1.24.3) Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.7/dist-packages (from requests<3,>=2.21.0->tensorboard<2.10,>=2.9->tensorflow-gpu) (2022.6.15) Requirement already satisfied: oauthlib>=3.0.0 in /usr/local/lib/python3.7/dist-packages (from requests-oauthlib>=0.7.0->google-auth-oauthlib<0.5,>=0.4.1->tensorboard<2.10,>=2.9->tensorflow-gpu) (3.2.0) Requirement already satisfied: pyparsing>=2.0.2 in /usr/local/lib/python3.7/dist-packages (from packaging->tensorflow-gpu) (3.0.9) Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/ Requirement already satisfied: tqdm in /usr/local/lib/python3.7/dist-packages (4.64.0) Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/ Requirement already satisfied: pydicom in /usr/local/lib/python3.7/dist-packages (2.3.0) Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/ Requirement already satisfied: albumentations in /usr/local/lib/python3.7/dist-packages (1.2.0) Requirement already satisfied: numpy>=1.11.1 in /usr/local/lib/python3.7/dist-packages (from albumentations) (1.21.6) Requirement already satisfied: qudida>=0.0.4 in /usr/local/lib/python3.7/dist-packages (from albumentations) (0.0.4) Requirement already satisfied: scipy in /usr/local/lib/python3.7/dist-packages (from albumentations) (1.4.1) Requirement already satisfied: scikit-image<0.19,>=0.16.1 in /usr/local/lib/python3.7/dist-packages (from albumentations) (0.18.3) Requirement already satisfied: opencv-python-headless>=4.1.1 in /usr/local/lib/python3.7/dist-packages (from albumentations) (4.6.0.66) Requirement already satisfied: PyYAML in /usr/local/lib/python3.7/dist-packages (from albumentations) (6.0) Requirement already satisfied: typing-extensions in /usr/local/lib/python3.7/dist-packages (from qudida>=0.0.4->albumentations) (4.1.1) Requirement already satisfied: scikit-learn>=0.19.1 in /usr/local/lib/python3.7/dist-packages (from qudida>=0.0.4->albumentations) (1.0.2) Requirement already satisfied: pillow!=7.1.0,!=7.1.1,>=4.3.0 in /usr/local/lib/python3.7/dist-packages (from scikit-image<0.19,>=0.16.1->albumentations) (7.1.2) Requirement already satisfied: imageio>=2.3.0 in /usr/local/lib/python3.7/dist-packages (from scikit-image<0.19,>=0.16.1->albumentations) (2.4.1) Requirement already satisfied: tifffile>=2019.7.26 in /usr/local/lib/python3.7/dist-packages (from scikit-image<0.19,>=0.16.1->albumentations) (2021.11.2) Requirement already satisfied: networkx>=2.0 in /usr/local/lib/python3.7/dist-packages (from scikit-image<0.19,>=0.16.1->albumentations) (2.6.3) Requirement already satisfied: matplotlib!=3.0.0,>=2.0.0 in /usr/local/lib/python3.7/dist-packages (from scikit-image<0.19,>=0.16.1->albumentations) (3.2.2) Requirement already satisfied: PyWavelets>=1.1.1 in /usr/local/lib/python3.7/dist-packages (from scikit-image<0.19,>=0.16.1->albumentations) (1.3.0) Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.7/dist-packages (from matplotlib!=3.0.0,>=2.0.0->scikit-image<0.19,>=0.16.1->albumentations) (0.11.0) Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /usr/local/lib/python3.7/dist-packages (from matplotlib!=3.0.0,>=2.0.0->scikit-image<0.19,>=0.16.1->albumentations) (3.0.9) Requirement already satisfied: python-dateutil>=2.1 in /usr/local/lib/python3.7/dist-packages (from matplotlib!=3.0.0,>=2.0.0->scikit-image<0.19,>=0.16.1->albumentations) (2.8.2) Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.7/dist-packages (from matplotlib!=3.0.0,>=2.0.0->scikit-image<0.19,>=0.16.1->albumentations) (1.4.3) Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.7/dist-packages (from python-dateutil>=2.1->matplotlib!=3.0.0,>=2.0.0->scikit-image<0.19,>=0.16.1->albumentations) (1.15.0) Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.7/dist-packages (from scikit-learn>=0.19.1->qudida>=0.0.4->albumentations) (3.1.0) Requirement already satisfied: joblib>=0.11 in /usr/local/lib/python3.7/dist-packages (from scikit-learn>=0.19.1->qudida>=0.0.4->albumentations) (1.1.0)
# %tensorflow_version 2.x
%matplotlib inline
import tensorflow as tf
# Initialize the random number generator
import random
random.seed(0)
# Ignore the warnings
import warnings
warnings.filterwarnings("ignore")
rootDir='/content/drive/MyDrive/AIML/Projects/'
#zipFilename=rootDir+'rsna-pneumonia-detection-challenge.zip'
datasetPath=rootDir+'CV - Pneumonia Detection/'
from glob import glob
import os
import time
import math
import fnmatch
from zipfile import ZipFile
from tqdm import tqdm_notebook
import numpy as np
import pandas as pd
import cv2
import matplotlib.pyplot as plt
import matplotlib.patches as patches
from matplotlib.patches import Rectangle
import pydicom as dicom
import seaborn as sns
from sklearn.utils import shuffle
from skimage.measure import label, regionprops
from sklearn.metrics import precision_recall_fscore_support as prf
import albumentations as A
from tensorflow import keras
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tensorflow.python.keras.utils.data_utils import Sequence
from tensorflow.keras.layers import Conv2D, Input, Flatten, Dense, Dropout, Concatenate, BatchNormalization, Conv2DTranspose
from tensorflow.keras.models import Model, Sequential, load_model
import tensorflow.keras.backend as K
from tensorflow.keras.applications import VGG16
from tensorflow.keras.applications.vgg16 import preprocess_input
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import ModelCheckpoint, EarlyStopping, ReduceLROnPlateau
from tensorflow.keras.losses import binary_crossentropy
stage_2_train_labels.csv - CSV file containing the patient id, bounding boxes and target label
stage_2_detailed_class_info.csv - CSV file containing the detail informaiton of patientid and the corresponding label
stage_2_train_images - directory contains train images in DICOM format
stage_2_test_images - directory contains test images in DICOM format
trainImagesDir=datasetPath+'stage_2_train_images/'
testImagesDir=datasetPath+'stage_2_test_images/'
sampleSubmission=datasetPath+'stage_2_sample_submission.csv'
classInfo=datasetPath+'stage_2_detailed_class_info.csv'
rsnaLink=datasetPath+'GCP Credits Request Link - RSNA.txt'
trainLabels=datasetPath+'stage_2_train_labels.csv'
os.path.exists(trainImagesDir)
True
image_train_path = glob(trainImagesDir+'*.dcm')
image_test_path = glob(testImagesDir+'*.dcm')
print("Number of images in train set:", len(image_train_path),"\nNumber of images in test set:", len(image_test_path))
Number of images in train set: 26684 Number of images in test set: 3000
# Loading the data
# There are two input files given - Detailed class info and train labels
class_info_df = pd.read_csv(classInfo)
train_labels_df = pd.read_csv(trainLabels)
print("Detailed class info - rows: {}, columns: {}".format(class_info_df.shape[0], class_info_df.shape[1]))
print("Train labels - rows: {}, columns: {}".format(train_labels_df.shape[0], train_labels_df.shape[1]))
Detailed class info - rows: 30227, columns: 2 Train labels - rows: 30227, columns: 6
class_info_df.head(10)
| patientId | class | |
|---|---|---|
| 0 | 0004cfab-14fd-4e49-80ba-63a80b6bddd6 | No Lung Opacity / Not Normal |
| 1 | 00313ee0-9eaa-42f4-b0ab-c148ed3241cd | No Lung Opacity / Not Normal |
| 2 | 00322d4d-1c29-4943-afc9-b6754be640eb | No Lung Opacity / Not Normal |
| 3 | 003d8fa0-6bf1-40ed-b54c-ac657f8495c5 | Normal |
| 4 | 00436515-870c-4b36-a041-de91049b9ab4 | Lung Opacity |
| 5 | 00436515-870c-4b36-a041-de91049b9ab4 | Lung Opacity |
| 6 | 00569f44-917d-4c86-a842-81832af98c30 | No Lung Opacity / Not Normal |
| 7 | 006cec2e-6ce2-4549-bffa-eadfcd1e9970 | No Lung Opacity / Not Normal |
| 8 | 00704310-78a8-4b38-8475-49f4573b2dbb | Lung Opacity |
| 9 | 00704310-78a8-4b38-8475-49f4573b2dbb | Lung Opacity |
train_labels_df.head(10)
| patientId | x | y | width | height | Target | |
|---|---|---|---|---|---|---|
| 0 | 0004cfab-14fd-4e49-80ba-63a80b6bddd6 | NaN | NaN | NaN | NaN | 0 |
| 1 | 00313ee0-9eaa-42f4-b0ab-c148ed3241cd | NaN | NaN | NaN | NaN | 0 |
| 2 | 00322d4d-1c29-4943-afc9-b6754be640eb | NaN | NaN | NaN | NaN | 0 |
| 3 | 003d8fa0-6bf1-40ed-b54c-ac657f8495c5 | NaN | NaN | NaN | NaN | 0 |
| 4 | 00436515-870c-4b36-a041-de91049b9ab4 | 264.0 | 152.0 | 213.0 | 379.0 | 1 |
| 5 | 00436515-870c-4b36-a041-de91049b9ab4 | 562.0 | 152.0 | 256.0 | 453.0 | 1 |
| 6 | 00569f44-917d-4c86-a842-81832af98c30 | NaN | NaN | NaN | NaN | 0 |
| 7 | 006cec2e-6ce2-4549-bffa-eadfcd1e9970 | NaN | NaN | NaN | NaN | 0 |
| 8 | 00704310-78a8-4b38-8475-49f4573b2dbb | 323.0 | 577.0 | 160.0 | 104.0 | 1 |
| 9 | 00704310-78a8-4b38-8475-49f4573b2dbb | 695.0 | 575.0 | 162.0 | 137.0 | 1 |
In Detailed class info dataset , the detailed information about the type of class associated with a certain patientId is given. It has 3 entries "Lung Opacity", "Normal" and "No Lung Opacity/Not Normal"
The CSV file contains PatientId, bounding box details with (x,y) coordinates and width and height that encapsulates the box. It also contains the Target variable. For target variable 0, the bounding box values has NaN values.
If we look closely, there are duplicate entries for patientId in the csv files. We can observe row #4 and #5, row #8 and #9 have same patientId values, aka, the patient is identified with pneumonia at multiple areas in lungs
Check the unique patient ID in the train dataset
print("Unique patientId in train_class_df: ", train_labels_df['patientId'].nunique())
Unique patientId in train_class_df: 26684
train_labels_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 30227 entries, 0 to 30226 Data columns (total 6 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 patientId 30227 non-null object 1 x 9555 non-null float64 2 y 9555 non-null float64 3 width 9555 non-null float64 4 height 9555 non-null float64 5 Target 30227 non-null int64 dtypes: float64(4), int64(1), object(1) memory usage: 1.4+ MB
print(train_labels_df[train_labels_df.Target==0].shape[0])
print(train_labels_df[train_labels_df.Target==1].shape[0])
20672 9555
We see from above that the total number of patientIds that are identified with Pneumonia are 9555 and it matches to the non null values. It can be inferred from this that all pneumonia data set has bounding boxes defined and for normal patients, no bounding boxes exist.
def missing_data(data):
total = data.isnull().sum().sort_values(ascending = False)
percent = (data.isnull().sum()/data.isnull().count()*100).sort_values(ascending = False)
return np.transpose(pd.concat([total, percent], axis=1, keys=['Total', 'Percent']))
missing_data(train_labels_df)
| x | y | width | height | patientId | Target | |
|---|---|---|---|---|---|---|
| Total | 20672.000000 | 20672.000000 | 20672.000000 | 20672.000000 | 0.0 | 0.0 |
| Percent | 68.389188 | 68.389188 | 68.389188 | 68.389188 | 0.0 | 0.0 |
missing_data(class_info_df)
| patientId | class | |
|---|---|---|
| Total | 0.0 | 0.0 |
| Percent | 0.0 | 0.0 |
plt.rc('axes', labelsize=15)
plt.rc('axes', titlesize=20)
sns.set_palette('Set2')
f, ax = plt.subplots(1,1, figsize=(6,4))
total = float(class_info_df.shape[0])
sns.countplot(class_info_df['class'], order = class_info_df['class'].value_counts().index)
for p in ax.patches:
height = p.get_height()
ax.text(p.get_x()+p.get_width()/2.,
height + 3,
'{:1.2f}%'.format(100*height/total),
ha="center")
plt.show()
# More details on classes - No Lung Opacity / Not Normal, Lung Opacity, Normal
def get_feature_distribution(data, feature):
# Get the count for each label
label_counts = data[feature].value_counts()
# Get total number of samples
total_samples = data.shape[0]
# Count the number of items in each class
print("{:<30s}: count(percentage)".format(feature))
for i in range(len(label_counts)):
label = label_counts.index[i]
count = label_counts.values[i]
percent = round((count / total_samples) * 100, 2)
print("{:<30s}: {}({}%)".format(label, count, percent))
get_feature_distribution(class_info_df, 'class')
class : count(percentage) No Lung Opacity / Not Normal : 11821(39.11%) Lung Opacity : 9555(31.61%) Normal : 8851(29.28%)
No Lung Opacity / Not Normal and Normal have together the same percent (68.39%) as the percent of missing values for target window in class details information.
In the train set, the percent of data with pneumonia is therefore 31.61%.
train_labels_df.Target.unique()
array([0, 1])
The target has two classifications 0 and 1 namely Normal and Pneumonia
train_labels_df.shape[0], class_info_df.shape[0]
(30227, 30227)
# merging the two datasets (train and class detail info) using Patient ID as the merge criteria
train_class_df = train_labels_df.merge(class_info_df, left_on='patientId', right_on='patientId', how='inner')
train_class_df.sample(5)
| patientId | x | y | width | height | Target | class | |
|---|---|---|---|---|---|---|---|
| 29870 | d4e000af-5843-4f76-90b3-0b387d860142 | NaN | NaN | NaN | NaN | 0 | Normal |
| 34950 | f9948c5e-f97b-45a2-a5f5-236cf83fdc13 | 290.0 | 227.0 | 166.0 | 330.0 | 1 | Lung Opacity |
| 11912 | 66d020e7-2f2a-41b6-a50b-abbf958588aa | NaN | NaN | NaN | NaN | 0 | No Lung Opacity / Not Normal |
| 7635 | 47a778f5-cd9e-4c76-a689-74ac30071b8d | NaN | NaN | NaN | NaN | 0 | No Lung Opacity / Not Normal |
| 32825 | eb632397-ff01-45ba-9bb7-23d1d376821b | NaN | NaN | NaN | NaN | 0 | No Lung Opacity / Not Normal |
#plotting the number of examinations for each class detected, grouped by Target value
fig, ax = plt.subplots(nrows=1,figsize=(12,6))
tmp = train_class_df.groupby('Target')['class'].value_counts()
df = pd.DataFrame(data={'Freq': tmp.values}, index=tmp.index).reset_index()
sns.barplot(ax=ax, x='Target', y='Freq', hue='class', data=df)
plt.title("Chest examination - Frequency of Targets")
plt.show()
samplePatientID = list(train_class_df[:3].T.to_dict().values())[0]['patientId']
samplePatientID = samplePatientID+'.dcm'
dicom_file_path = os.path.join(trainImagesDir, samplePatientID)
dicom_file_dataset = dicom.read_file(dicom_file_path)
dicom_file_dataset
Dataset.file_meta ------------------------------- (0002, 0000) File Meta Information Group Length UL: 202 (0002, 0001) File Meta Information Version OB: b'\x00\x01' (0002, 0002) Media Storage SOP Class UID UI: Secondary Capture Image Storage (0002, 0003) Media Storage SOP Instance UID UI: 1.2.276.0.7230010.3.1.4.8323329.28530.1517874485.775526 (0002, 0010) Transfer Syntax UID UI: JPEG Baseline (Process 1) (0002, 0012) Implementation Class UID UI: 1.2.276.0.7230010.3.0.3.6.0 (0002, 0013) Implementation Version Name SH: 'OFFIS_DCMTK_360' ------------------------------------------------- (0008, 0005) Specific Character Set CS: 'ISO_IR 100' (0008, 0016) SOP Class UID UI: Secondary Capture Image Storage (0008, 0018) SOP Instance UID UI: 1.2.276.0.7230010.3.1.4.8323329.28530.1517874485.775526 (0008, 0020) Study Date DA: '19010101' (0008, 0030) Study Time TM: '000000.00' (0008, 0050) Accession Number SH: '' (0008, 0060) Modality CS: 'CR' (0008, 0064) Conversion Type CS: 'WSD' (0008, 0090) Referring Physician's Name PN: '' (0008, 103e) Series Description LO: 'view: PA' (0010, 0010) Patient's Name PN: '0004cfab-14fd-4e49-80ba-63a80b6bddd6' (0010, 0020) Patient ID LO: '0004cfab-14fd-4e49-80ba-63a80b6bddd6' (0010, 0030) Patient's Birth Date DA: '' (0010, 0040) Patient's Sex CS: 'F' (0010, 1010) Patient's Age AS: '51' (0018, 0015) Body Part Examined CS: 'CHEST' (0018, 5101) View Position CS: 'PA' (0020, 000d) Study Instance UID UI: 1.2.276.0.7230010.3.1.2.8323329.28530.1517874485.775525 (0020, 000e) Series Instance UID UI: 1.2.276.0.7230010.3.1.3.8323329.28530.1517874485.775524 (0020, 0010) Study ID SH: '' (0020, 0011) Series Number IS: '1' (0020, 0013) Instance Number IS: '1' (0020, 0020) Patient Orientation CS: '' (0028, 0002) Samples per Pixel US: 1 (0028, 0004) Photometric Interpretation CS: 'MONOCHROME2' (0028, 0010) Rows US: 1024 (0028, 0011) Columns US: 1024 (0028, 0030) Pixel Spacing DS: [0.14300000000000002, 0.14300000000000002] (0028, 0100) Bits Allocated US: 8 (0028, 0101) Bits Stored US: 8 (0028, 0102) High Bit US: 7 (0028, 0103) Pixel Representation US: 0 (0028, 2110) Lossy Image Compression CS: '01' (0028, 2114) Lossy Image Compression Method CS: 'ISO_10918_1' (7fe0, 0010) Pixel Data OB: Array of 142006 elements
It is observed that some useful information are available in the DICOM metadata with predictive values, for example:
Patient sex, Patient age, Modality, Body part examined, View position, Rows & Columns, Pixel Spacing
def show_dicom_images(data, bbox=False):
img_data = list(data.T.to_dict().values())
f, ax = plt.subplots(2,2, figsize=(16,18))
for i,data_row in enumerate(img_data):
patientImage = data_row['patientId']+'.dcm'
imagePath = os.path.join(trainImagesDir, patientImage)
data_row_img_data = dicom.read_file(imagePath)
modality = data_row_img_data.Modality
age = data_row_img_data.PatientAge
sex = data_row_img_data.PatientSex
data_row_img = dicom.dcmread(imagePath)
ax[i//2, i%2].imshow(data_row_img.pixel_array, cmap=plt.cm.bone)
ax[i//2, i%2].axis('off')
ax[i//2, i%2].set_title('ID: {}\nModality: {} Age: {} Sex: {} Target: {}\nClass: {}\nBounding box: {}:{}:{}:{}'.format(
data_row['patientId'],
modality, age, sex, data_row['Target'], data_row['class'],
data_row['x'],data_row['y'],data_row['width'],data_row['height']))
if bbox:
rows = train_class_df[train_class_df['patientId']==data_row['patientId']]
box_data = list(rows.T.to_dict().values())
for j, row in enumerate(box_data):
ax[i//2, i%2].add_patch(Rectangle(xy=(row['x'], row['y']),
width=row['width'],height=row['height'],
color="yellow",alpha = 0.1))
plt.tight_layout()
plt.show()
show_dicom_images(train_class_df[train_class_df['Target']==1].sample(4))
show_dicom_images(train_class_df[train_class_df['Target']==1].sample(4), bbox=True)
For some of the images with Target=1, we could see multiple areas (boxes/rectangles) with Lung Opacity.
print('A maximum of {} areas are detected in Lungs for pneumonia patient'.format(max(train_labels_df.patientId.value_counts())))
A maximum of 4 areas are detected in Lungs for pneumonia patient
show_dicom_images(train_class_df[train_class_df['Target']==0].sample(4))
# parsing the DICOM meta information and add it to the train dataset
vars = ['Modality', 'PatientAge', 'PatientSex', 'BodyPartExamined', 'ViewPosition', 'ConversionType', 'Rows', 'Columns', 'PixelSpacing']
def process_dicom_data(data_df, data_path):
for var in vars:
data_df[var] = None
image_names = os.listdir(data_path)
for i, img_name in tqdm_notebook(enumerate(image_names)):
imagePath = os.path.join(data_path,img_name)
data_row_img_data = dicom.read_file(imagePath)
idx = (data_df['patientId']==data_row_img_data.PatientID)
data_df.loc[idx,'Modality'] = data_row_img_data.Modality
data_df.loc[idx,'PatientAge'] = pd.to_numeric(data_row_img_data.PatientAge)
data_df.loc[idx,'PatientSex'] = data_row_img_data.PatientSex
data_df.loc[idx,'BodyPartExamined'] = data_row_img_data.BodyPartExamined
data_df.loc[idx,'ViewPosition'] = data_row_img_data.ViewPosition
data_df.loc[idx,'ConversionType'] = data_row_img_data.ConversionType
data_df.loc[idx,'Rows'] = data_row_img_data.Rows
data_df.loc[idx,'Columns'] = data_row_img_data.Columns
data_df.loc[idx,'PixelSpacing'] = str.format("{:4.3f}",data_row_img_data.PixelSpacing[0])
return data_df
gpuname=tf.test.gpu_device_name()
print('gpu name:',gpuname)
if gpuname:
with tf.device(gpuname):
train_class_df = process_dicom_data(train_class_df, trainImagesDir)
else:
train_class_df = process_dicom_data(train_class_df, trainImagesDir)
gpu name: /device:GPU:0
# Creating Test dataset with similar information as that of train set
# stage_2_sample_submission.csv - Contains patientIds for the test set. sample submission contains one box per image, but there is no limit to the number of bounding boxes that can be assigned to a given image.
test_class_df = pd.read_csv(sampleSubmission)
test_class_df.head()
| patientId | PredictionString | |
|---|---|---|
| 0 | 0000a175-0e68-4ca4-b1af-167204a7e0bc | 0.5 0 0 100 100 |
| 1 | 0005d3cc-3c3f-40b9-93c3-46231c3eb813 | 0.5 0 0 100 100 |
| 2 | 000686d7-f4fc-448d-97a0-44fa9c5d3aa6 | 0.5 0 0 100 100 |
| 3 | 000e3a7d-c0ca-4349-bb26-5af2d8993c3d | 0.5 0 0 100 100 |
| 4 | 00100a24-854d-423d-a092-edcf6179e061 | 0.5 0 0 100 100 |
test_class_df = test_class_df.drop('PredictionString',1)
if gpuname:
with tf.device(gpuname):
test_class_df = process_dicom_data(test_class_df, testImagesDir)
else:
test_class_df = process_dicom_data(test_class_df, testImagesDir)
# Checking how many modalities are used
print("Modalities: train:",train_class_df['Modality'].unique(), "test:", test_class_df['Modality'].unique())
Modalities: train: ['CR'] test: ['CR']
# checking if other body parts than 'CHEST' appears in the data
print("Body Part Examined: train:",train_class_df['BodyPartExamined'].unique(), "test:", test_class_df['BodyPartExamined'].unique())
Body Part Examined: train: ['CHEST'] test: ['CHEST']
# View Position is a radiographic view associated with the Patient Position. Let's check the View Positions distribution for the both datasets
print("View Position: train:",train_class_df['ViewPosition'].unique(), "test:", test_class_df['ViewPosition'].unique())
View Position: train: ['PA' 'AP'] test: ['PA' 'AP']
#Train dataset-checking the distribution of PA and AP
get_feature_distribution(train_class_df,'ViewPosition')
ViewPosition : count(percentage) AP : 21817(57.98%) PA : 15812(42.02%)
get_feature_distribution(test_class_df,'ViewPosition')
ViewPosition : count(percentage) PA : 1618(53.93%) AP : 1382(46.07%)
# Conversion Type: Let's check the Conversion Type data
print("Conversion Type: train:",train_class_df['ConversionType'].unique(), "test:", test_class_df['ConversionType'].unique())
Conversion Type: train: ['WSD'] test: ['WSD']
# Rows and columns
print("Rows: train:",train_class_df['Rows'].unique(), "test:", test_class_df['Rows'].unique())
print("Columns: train:",train_class_df['Columns'].unique(), "test:", test_class_df['Columns'].unique())
Rows: train: [1024] test: [1024] Columns: train: [1024] test: [1024]
Only {Rows:Columns} {1024:1024} are present in both train and test.
Even though the size of the images are same, we can observe that each image has different aspect ratio, that is, not all images occupy the same space. Also, the brightness is different.
#Test dataset
fig, (ax) = plt.subplots(nrows=1,figsize=(16,6))
sns.countplot(test_class_df['PatientAge'], ax=ax)
plt.title("Test set: Patient Age")
plt.xticks(rotation=90)
plt.show()
# Test Data
sns.countplot(test_class_df['PatientSex'])
plt.title("Test set: Patient Sex")
plt.show()
def update_dataset(path, df1):
pid=[]
label=[]
bbox=[]
for name, group in df1.groupby(['patientId','Target']):
pid.append(path+group['patientId'].tolist()[0]+'.dcm')
label.append(group['Target'].tolist()[0])
if group['Target'].tolist()[0] == 1:
ibbox=[]
for row in group.iterrows():
ibbox.append([row[1]['x'], row[1]['y'], row[1]['width'], row[1]['height']])
bbox.append(ibbox)
else:
bbox.append([])
df = pd.DataFrame({'patientId':pid, 'bboxes': bbox, 'label':label})
return df
df=update_dataset(trainImagesDir, train_labels_df)
print(df.shape)
df.head()
(26684, 3)
| patientId | bboxes | label | |
|---|---|---|---|
| 0 | /content/drive/MyDrive/AIML/Projects/CV - Pneu... | [] | 0 |
| 1 | /content/drive/MyDrive/AIML/Projects/CV - Pneu... | [] | 0 |
| 2 | /content/drive/MyDrive/AIML/Projects/CV - Pneu... | [[316.0, 318.0, 170.0, 478.0], [660.0, 375.0, ... | 1 |
| 3 | /content/drive/MyDrive/AIML/Projects/CV - Pneu... | [[570.0, 282.0, 269.0, 409.0], [83.0, 227.0, 2... | 1 |
| 4 | /content/drive/MyDrive/AIML/Projects/CV - Pneu... | [[66.0, 160.0, 373.0, 608.0], [552.0, 164.0, 3... | 1 |
print('Total number of patients that are normal are {}'.format(df[df.label==0].shape[0]))
print('Total number of patients that have pneumonia are {}'.format(df[df.label==1].shape[0]))
Total number of patients that are normal are 20672 Total number of patients that have pneumonia are 6012
imgWidth=224
imgHeight=224
imgChannels=3
imgSize=(imgHeight, imgWidth)
batchSize=64
labelDict={0:'normal', 1:'lung opacity'}
def check_images_shape(path):
img_path=glob(path + '*.dcm')
imagesShape=[dicom.dcmread(image).pixel_array.shape for image in tqdm_notebook(img_path)]
print(set(imagesShape))
check_images_shape(trainImagesDir)
{(1024, 1024)}
c=math.ceil(df.shape[0]*0.7)
train_df,val_df=df[:c],df[c:]
print(train_df.shape, val_df.shape)
(18679, 3) (8005, 3)
# image transformer object
transform = A.Compose([
A.RandomRotate90(),
A.Flip(),
A.Transpose(),
A.OneOf([
A.IAAAdditiveGaussianNoise(),
A.GaussNoise(),
], p=0.2),
A.OneOf([
A.MotionBlur(p=.2),
A.MedianBlur(blur_limit=3, p=0.1),
A.Blur(blur_limit=3, p=0.1),
], p=0.2),
A.ShiftScaleRotate(shift_limit=0.0625, scale_limit=0.2, rotate_limit=45, p=0.2),
A.OneOf([
A.OpticalDistortion(p=0.3),
A.GridDistortion(p=.1),
A.IAAPiecewiseAffine(p=0.3),
], p=0.2),
A.OneOf([
A.CLAHE(clip_limit=2),
A.IAASharpen(),
A.IAAEmboss(),
A.RandomBrightnessContrast(),
], p=0.3),
A.HueSaturationValue(p=0.3),
])
# Custom datagenerator with mask approach
#keras = tf.compat.v1.keras
#Sequence = keras.utils.Sequence
class CustomDataGen(Sequence):
def __init__(self, df, x_col, y_col, batch_size, input_size=(224, 224, 3), preprocess_function=None, shuffle=True, predict=False, include_labels=False):
self.df = df.copy(deep=True)
self.x_col = x_col
self.y_col = y_col
self.batch_size = batch_size
self.input_size = input_size
self.shuffle = shuffle
self.predict = predict
self.n = df.shape[0]
self.n_class = df[y_col['output1']].nunique()
self.preprocess_function=preprocess_function
self.include_labels=include_labels
def on_epoch_end(self):
if self.shuffle:
shuffle(self.df.patientId)
def __loadinput(self, img_path, lbl, bboxes):
image = dicom.dcmread(img_path).pixel_array
img_size = (image.shape[0],image.shape[1])
image = cv2.resize(image,(self.input_size[0], self.input_size[1]))
if len(image.shape) !=3 or image.shape[2]!=3:
image = np.stack([image] * 3, axis=-1)
# Add data augmention, for train and validation
if not self.include_labels:
image = transform(image=image)['image']
if self.preprocess_function != None:
image = self.preprocess_function(image)
if self.predict:
return image
else:
mask = np.zeros((self.input_size[0], self.input_size[1]))
for i, bbox in enumerate(bboxes):
x, y, w, h = bbox
x1=math.floor(x*self.input_size[1]/img_size[1])
y1=math.floor(y*self.input_size[0]/img_size[0])
x2=math.ceil((x+w)*self.input_size[1]/img_size[1])
y2=math.ceil((y+h)*self.input_size[0]/img_size[0])
mask[y1:y2, x1:x2] = 1
if self.include_labels:
return image, img_path, lbl, mask
else:
return image, mask
def __loaddata(self, batches):
# Generates data containing batch_size samples
path_batch = batches[self.x_col['input']]
classes_batch = batches[self.y_col['output1']]
bboxes_batch = batches[self.y_col['output2']]
img_batch = [self.__loadinput(filename, lbl, bboxes) for filename, lbl, bboxes in zip(path_batch, classes_batch, bboxes_batch)]
return img_batch
def __getitem__(self, index):
batches = self.df[index * self.batch_size:(index + 1) * self.batch_size]
batchdata = self.__loaddata(batches)
if self.predict:
return batchdata
else:
if self.include_labels:
image, img_path, lbl, mask = zip(*batchdata)
return np.array(image), np.array(img_path), np.array(lbl), np.array(mask)
else:
image, mask = zip(*batchdata)
return np.array(image), np.array(mask)
def __len__(self):
return math.ceil(self.n / self.batch_size)
Need to identify the distribution (Pneumonia and normal) of the data used Consider a portion of the dataset for train (2000) and validation (25%) of train
trainsize=2000
valsize=int(0.25*trainsize)
trainpartial=train_df[:trainsize]
valpartial=val_df[:valsize]
print('Distribution of labels in the train data are ',
(trainpartial.label.value_counts().values/trainpartial.label.value_counts().values.sum())*100)
print('Distribution of labels in the validation data are ',
(valpartial.label.value_counts().values/valpartial.label.value_counts().values.sum())*100)
Distribution of labels in the train data are [67.85 32.15] Distribution of labels in the validation data are [56.6 43.4]
The distribution of the data in both train and validation is good. Also, the distribution of same label data across the train adn test are also good to proceed
xcol={'input':'patientId'}
ycol={'output1': 'label', 'output2': 'bboxes'}
traingen = CustomDataGen( trainpartial,
x_col=xcol,
y_col=ycol,
batch_size=batchSize, input_size=(imgSize),
preprocess_function=preprocess_input
)
valgen = CustomDataGen( valpartial,
x_col=xcol,
y_col=ycol,
batch_size=batchSize, input_size=(imgSize),
preprocess_function=preprocess_input
)
Freeze initial layers and train the top 3 layers to capture the features for the images that require to train
def create_vgg16_unetmodel(freeze_layers=16):
print('Creating VGG16 model')
model = VGG16(input_shape=(imgHeight, imgWidth, 3), include_top=False, weights="imagenet")
layers = model.layers
for i in range(freeze_layers):
layers[i].trainable = False
#encoder - get output layers for concatenate with the upsampling layer
block1 = model.get_layer("block1_conv2").output
block2 = model.get_layer("block2_conv2").output
block3 = model.get_layer("block3_conv3").output
block4 = model.get_layer("block4_conv3").output
block5 = model.get_layer("block5_conv3").output
# block6 = model.get_layer("block5_pool").output
#Decoder block
x = Conv2DTranspose(512, (2, 2), strides=(2, 2), padding='same') (block5)
x = Concatenate()([x, block4])
x = Conv2D(512, (3, 3), activation='relu', padding='same') (x)
x = Conv2D(512, (3, 3), activation='relu', padding='same') (x)
x = Conv2D(512, (3, 3), activation='relu', padding='same') (x)
x= BatchNormalization()(x)
x=Dropout(0.5)(x)
x = Conv2DTranspose(256, (2, 2), strides=(2, 2), padding='same') (x)
x = Concatenate()([x, block3])
x = Conv2D(256, (3, 3), activation='relu', padding='same') (x)
x = Conv2D(256, (3, 3), activation='relu', padding='same') (x)
x = Conv2D(256, (3, 3), activation='relu', padding='same') (x)
x= BatchNormalization()(x)
x=Dropout(0.5)(x)
x = Conv2DTranspose(128, (2, 2), strides=(2, 2), padding='same') (x)
x = Concatenate()([x, block2])
x = Conv2D(128, (3, 3), activation='relu', padding='same') (x)
x = Conv2D(128, (3, 3), activation='relu', padding='same') (x)
x= BatchNormalization()(x)
x=Dropout(0.5)(x)
x = Conv2DTranspose(64, (2, 2), strides=(2, 2), padding='same') (x)
x = Concatenate()([x, block1])
x = Conv2D(64, (3, 3), activation='relu', padding='same') (x)
x = Conv2D(64, (3, 3), activation='relu', padding='same') (x)
x= BatchNormalization()(x)
x = Conv2D(1, kernel_size=1, activation="sigmoid")(x)
vgg16unetmodel = Model(inputs=model.input, outputs=x)
return vgg16unetmodel
model=create_vgg16_unetmodel()
model.summary()
Creating VGG16 model
Model: "model"
__________________________________________________________________________________________________
Layer (type) Output Shape Param # Connected to
==================================================================================================
input_1 (InputLayer) [(None, 224, 224, 3 0 []
)]
block1_conv1 (Conv2D) (None, 224, 224, 64 1792 ['input_1[0][0]']
)
block1_conv2 (Conv2D) (None, 224, 224, 64 36928 ['block1_conv1[0][0]']
)
block1_pool (MaxPooling2D) (None, 112, 112, 64 0 ['block1_conv2[0][0]']
)
block2_conv1 (Conv2D) (None, 112, 112, 12 73856 ['block1_pool[0][0]']
8)
block2_conv2 (Conv2D) (None, 112, 112, 12 147584 ['block2_conv1[0][0]']
8)
block2_pool (MaxPooling2D) (None, 56, 56, 128) 0 ['block2_conv2[0][0]']
block3_conv1 (Conv2D) (None, 56, 56, 256) 295168 ['block2_pool[0][0]']
block3_conv2 (Conv2D) (None, 56, 56, 256) 590080 ['block3_conv1[0][0]']
block3_conv3 (Conv2D) (None, 56, 56, 256) 590080 ['block3_conv2[0][0]']
block3_pool (MaxPooling2D) (None, 28, 28, 256) 0 ['block3_conv3[0][0]']
block4_conv1 (Conv2D) (None, 28, 28, 512) 1180160 ['block3_pool[0][0]']
block4_conv2 (Conv2D) (None, 28, 28, 512) 2359808 ['block4_conv1[0][0]']
block4_conv3 (Conv2D) (None, 28, 28, 512) 2359808 ['block4_conv2[0][0]']
block4_pool (MaxPooling2D) (None, 14, 14, 512) 0 ['block4_conv3[0][0]']
block5_conv1 (Conv2D) (None, 14, 14, 512) 2359808 ['block4_pool[0][0]']
block5_conv2 (Conv2D) (None, 14, 14, 512) 2359808 ['block5_conv1[0][0]']
block5_conv3 (Conv2D) (None, 14, 14, 512) 2359808 ['block5_conv2[0][0]']
conv2d_transpose (Conv2DTransp (None, 28, 28, 512) 1049088 ['block5_conv3[0][0]']
ose)
concatenate (Concatenate) (None, 28, 28, 1024 0 ['conv2d_transpose[0][0]',
) 'block4_conv3[0][0]']
conv2d (Conv2D) (None, 28, 28, 512) 4719104 ['concatenate[0][0]']
conv2d_1 (Conv2D) (None, 28, 28, 512) 2359808 ['conv2d[0][0]']
conv2d_2 (Conv2D) (None, 28, 28, 512) 2359808 ['conv2d_1[0][0]']
batch_normalization (BatchNorm (None, 28, 28, 512) 2048 ['conv2d_2[0][0]']
alization)
dropout (Dropout) (None, 28, 28, 512) 0 ['batch_normalization[0][0]']
conv2d_transpose_1 (Conv2DTran (None, 56, 56, 256) 524544 ['dropout[0][0]']
spose)
concatenate_1 (Concatenate) (None, 56, 56, 512) 0 ['conv2d_transpose_1[0][0]',
'block3_conv3[0][0]']
conv2d_3 (Conv2D) (None, 56, 56, 256) 1179904 ['concatenate_1[0][0]']
conv2d_4 (Conv2D) (None, 56, 56, 256) 590080 ['conv2d_3[0][0]']
conv2d_5 (Conv2D) (None, 56, 56, 256) 590080 ['conv2d_4[0][0]']
batch_normalization_1 (BatchNo (None, 56, 56, 256) 1024 ['conv2d_5[0][0]']
rmalization)
dropout_1 (Dropout) (None, 56, 56, 256) 0 ['batch_normalization_1[0][0]']
conv2d_transpose_2 (Conv2DTran (None, 112, 112, 12 131200 ['dropout_1[0][0]']
spose) 8)
concatenate_2 (Concatenate) (None, 112, 112, 25 0 ['conv2d_transpose_2[0][0]',
6) 'block2_conv2[0][0]']
conv2d_6 (Conv2D) (None, 112, 112, 12 295040 ['concatenate_2[0][0]']
8)
conv2d_7 (Conv2D) (None, 112, 112, 12 147584 ['conv2d_6[0][0]']
8)
batch_normalization_2 (BatchNo (None, 112, 112, 12 512 ['conv2d_7[0][0]']
rmalization) 8)
dropout_2 (Dropout) (None, 112, 112, 12 0 ['batch_normalization_2[0][0]']
8)
conv2d_transpose_3 (Conv2DTran (None, 224, 224, 64 32832 ['dropout_2[0][0]']
spose) )
concatenate_3 (Concatenate) (None, 224, 224, 12 0 ['conv2d_transpose_3[0][0]',
8) 'block1_conv2[0][0]']
conv2d_8 (Conv2D) (None, 224, 224, 64 73792 ['concatenate_3[0][0]']
)
conv2d_9 (Conv2D) (None, 224, 224, 64 36928 ['conv2d_8[0][0]']
)
batch_normalization_3 (BatchNo (None, 224, 224, 64 256 ['conv2d_9[0][0]']
rmalization) )
conv2d_10 (Conv2D) (None, 224, 224, 1) 65 ['batch_normalization_3[0][0]']
==================================================================================================
Total params: 28,808,385
Trainable params: 18,811,393
Non-trainable params: 9,996,992
__________________________________________________________________________________________________
def iou_loss(y_true, y_pred):
y_true = tf.reshape(y_true, [-1])
y_pred = tf.reshape(y_pred, [-1])
intersection = tf.reduce_sum(float(y_true) * float(y_pred))
score = (intersection + 1.) / (tf.reduce_sum(float(y_true)) + tf.reduce_sum(float(y_pred)) - intersection + 1.)
return 1 - score
# combine bce loss and iou loss
def iou_bce_loss(y_true, y_pred):
return 0.5 * keras.losses.binary_crossentropy(y_true, y_pred) + 0.5 * iou_loss(y_true, y_pred)
# mean iou as a metric
def mean_iou(y_true, y_pred):
y_pred = tf.round(y_pred)
intersect = tf.reduce_sum(float(y_true) * float(y_pred), axis=[1])
union = tf.reduce_sum(float(y_true),axis=[1]) + tf.reduce_sum(float(y_pred),axis=[1])
smooth = tf.ones(tf.shape(intersect))
return tf.reduce_mean((intersect + smooth) / (union - intersect + smooth))
model_path=rootDir+"model_unet_vgg16.h5"
checkpoint = ModelCheckpoint(model_path, monitor="val_loss", verbose=1, save_best_only=True)
stop = EarlyStopping(monitor="val_loss", patience=5)
reduce_lr = ReduceLROnPlateau(monitor="val_loss", factor=0.1, patience=3, min_lr=1e-6, verbose=1)
adam=Adam(learning_rate=0.01, beta_1=0.9,beta_2=0.99)
start=time.time()
try:
model.compile(loss=iou_bce_loss, optimizer=adam, metrics=[mean_iou])
history = model.fit( traingen,
validation_data = valgen,
epochs=10,
batch_size=batchSize,
callbacks=[stop, reduce_lr, checkpoint]
)
except Exception as e:
print(e)
print('Time taken:',time.time()-start)
Epoch 1/10 32/32 [==============================] - ETA: 0s - loss: 0.6565 - mean_iou: 0.6922 Epoch 1: val_loss improved from inf to 0.73239, saving model to /content/drive/MyDrive/AIML/Projects/model_unet_vgg16.h5 32/32 [==============================] - 143s 4s/step - loss: 0.6565 - mean_iou: 0.6922 - val_loss: 0.7324 - val_mean_iou: 0.7781 - lr: 0.0100 Epoch 2/10 32/32 [==============================] - ETA: 0s - loss: 0.5702 - mean_iou: 0.8818 Epoch 2: val_loss did not improve from 0.73239 32/32 [==============================] - 91s 3s/step - loss: 0.5702 - mean_iou: 0.8818 - val_loss: 0.8759 - val_mean_iou: 0.7730 - lr: 0.0100 Epoch 3/10 32/32 [==============================] - ETA: 0s - loss: 0.5485 - mean_iou: 0.8271 Epoch 3: val_loss did not improve from 0.73239 32/32 [==============================] - 94s 3s/step - loss: 0.5485 - mean_iou: 0.8271 - val_loss: 0.8260 - val_mean_iou: 0.8313 - lr: 0.0100 Epoch 4/10 32/32 [==============================] - ETA: 0s - loss: 0.5337 - mean_iou: 0.7784 Epoch 4: val_loss improved from 0.73239 to 0.67759, saving model to /content/drive/MyDrive/AIML/Projects/model_unet_vgg16.h5 32/32 [==============================] - 86s 3s/step - loss: 0.5337 - mean_iou: 0.7784 - val_loss: 0.6776 - val_mean_iou: 0.8276 - lr: 0.0100 Epoch 5/10 32/32 [==============================] - ETA: 0s - loss: 0.5101 - mean_iou: 0.7909 Epoch 5: val_loss improved from 0.67759 to 0.64381, saving model to /content/drive/MyDrive/AIML/Projects/model_unet_vgg16.h5 32/32 [==============================] - 87s 3s/step - loss: 0.5101 - mean_iou: 0.7909 - val_loss: 0.6438 - val_mean_iou: 0.8305 - lr: 0.0100 Epoch 6/10 32/32 [==============================] - ETA: 0s - loss: 0.4892 - mean_iou: 0.8080 Epoch 6: val_loss improved from 0.64381 to 0.61332, saving model to /content/drive/MyDrive/AIML/Projects/model_unet_vgg16.h5 32/32 [==============================] - 85s 3s/step - loss: 0.4892 - mean_iou: 0.8080 - val_loss: 0.6133 - val_mean_iou: 0.8294 - lr: 0.0100 Epoch 7/10 32/32 [==============================] - ETA: 0s - loss: 0.4915 - mean_iou: 0.8194 Epoch 7: val_loss improved from 0.61332 to 0.54005, saving model to /content/drive/MyDrive/AIML/Projects/model_unet_vgg16.h5 32/32 [==============================] - 87s 3s/step - loss: 0.4915 - mean_iou: 0.8194 - val_loss: 0.5400 - val_mean_iou: 0.8176 - lr: 0.0100 Epoch 8/10 32/32 [==============================] - ETA: 0s - loss: 0.4801 - mean_iou: 0.8198 Epoch 8: val_loss improved from 0.54005 to 0.52060, saving model to /content/drive/MyDrive/AIML/Projects/model_unet_vgg16.h5 32/32 [==============================] - 85s 3s/step - loss: 0.4801 - mean_iou: 0.8198 - val_loss: 0.5206 - val_mean_iou: 0.8148 - lr: 0.0100 Epoch 9/10 32/32 [==============================] - ETA: 0s - loss: 0.4790 - mean_iou: 0.8320 Epoch 9: val_loss improved from 0.52060 to 0.51482, saving model to /content/drive/MyDrive/AIML/Projects/model_unet_vgg16.h5 32/32 [==============================] - 83s 3s/step - loss: 0.4790 - mean_iou: 0.8320 - val_loss: 0.5148 - val_mean_iou: 0.7987 - lr: 0.0100 Epoch 10/10 32/32 [==============================] - ETA: 0s - loss: 0.4780 - mean_iou: 0.8262 Epoch 10: val_loss improved from 0.51482 to 0.50707, saving model to /content/drive/MyDrive/AIML/Projects/model_unet_vgg16.h5 32/32 [==============================] - 85s 3s/step - loss: 0.4780 - mean_iou: 0.8262 - val_loss: 0.5071 - val_mean_iou: 0.7964 - lr: 0.0100 Time taken: 928.4293785095215
trainpred=model.evaluate(traingen)
32/32 [==============================] - 53s 2s/step - loss: 0.4897 - mean_iou: 0.8263
valpred=model.evaluate(valgen)
8/8 [==============================] - 13s 2s/step - loss: 0.4999 - mean_iou: 0.7987
plt.figure(figsize=(25,6))
plt.subplot(221)
plt.plot(history.epoch, history.history["loss"], label="Train loss")
plt.plot(history.epoch, history.history["val_loss"], label="Valid loss")
plt.legend()
plt.subplot(222)
plt.plot(history.epoch, history.history["mean_iou"], label="Train miou")
plt.plot(history.epoch, history.history["val_mean_iou"], label="Valid miou")
plt.legend()
plt.show()
def load_saved_model(modelname):
if os.path.exists(modelname):
return load_model(model_path, compile=False)
else:
print('model does not exist at {}'.format(modelname))
model=load_saved_model(model_path)
def extract_bboxes(mask):
lbl_0 = label(mask)
bboxes = regionprops(lbl_0)
img_boxes=[]
for bbox in bboxes:
y1, x1, y2, x2 = bbox.bbox
w, h = x2-x1, y2-y1
if w > mask.shape[1]*0.2 and h > mask.shape[0]*0.2: # keep bounding boxes that are atleast 20% of the image size elimiating small areas detected
img_boxes.append([x1,y1,w,h])
return img_boxes
def calculate_bbox(bboxes, overlapThresh):
bboxes=np.array(bboxes)
pick = []
x1 = bboxes[:,0]
y1 = bboxes[:,1]
w = bboxes[:,2]
h = bboxes[:,3]
w[w<0]=0
h[h<0]=0
x2 = x1 + w
y2 = y1 + h
area = (w + 1.0) * (h + 1.0)
idxs = np.argsort(y2)
while len(idxs) > 0:
last = len(idxs) - 1
i = idxs[last]
pick.append(i)
# find the largest (x, y) coordinates for the start of
# the bounding box and the smallest (x, y) coordinates
# for the end of the bounding box
xx1 = np.maximum(x1[i], x1[idxs[:last]])
yy1 = np.maximum(y1[i], y1[idxs[:last]])
xx2 = np.minimum(x2[i], x2[idxs[:last]])
yy2 = np.minimum(y2[i], y2[idxs[:last]])
w = np.maximum(0, xx2 - xx1 + 1)
h = np.maximum(0, yy2 - yy1 + 1)
overlap = (w * h) / area[idxs[:last]]
idxs = np.delete(idxs, np.concatenate(([last], np.where(overlap > 0.8)[0])))
return list(bboxes[pick].astype("int"))
def non_maxima_suppression(predmasks, maskTh=0.5, overlapThresh=0.5):
predABBoxes, predBBoxes=[], []
for i in range(len(predmasks)):
bboxes=extract_bboxes(predmasks[i][:,:] > maskTh)
if len(bboxes) == 0:
predABBoxes.append([])
else:
predABBoxes.append(calculate_bbox(bboxes, overlapThresh))
for i in range(len(predABBoxes)):
if len(predABBoxes[i])==0:
predBBoxes.append(predABBoxes[i])
else:
predBBoxes.append([list(y) for y in predABBoxes[i]])
return predBBoxes
predgen = CustomDataGen( valpartial,
x_col=xcol,
y_col=ycol,
batch_size=batchSize, input_size=(imgSize),
preprocess_function=preprocess_input,
include_labels=True
)
pred=[]
true=[]
imgs_path=[]
imgs=[]
labels=[]
for images, images_path, lbls, truemasks in tqdm_notebook(predgen):
imgs_path.extend(images_path)
imgs.extend(images)
labels.extend(lbls)
true.extend(truemasks)
pred_mask=model.predict(x=[images])
pred.extend(pred_mask)
pred=tf.image.resize(pred, (1024, 1024))
pred=non_maxima_suppression(np.asarray(pred).squeeze())
2/2 [==============================] - 6s 15ms/step 2/2 [==============================] - 0s 419ms/step 2/2 [==============================] - 1s 419ms/step 2/2 [==============================] - 0s 368ms/step 2/2 [==============================] - 0s 363ms/step 2/2 [==============================] - 0s 368ms/step 2/2 [==============================] - 0s 371ms/step 2/2 [==============================] - 5s 5s/step
predDF=pd.DataFrame({'patientId':imgs_path, 'predbboxes':pred})
predDF['predlbls']=predDF['predbboxes'].apply(lambda x:1 if len(x)>0 else 0)
predDF.sort_values(by='patientId', ignore_index=True, inplace=True)
combinedDF=pd.merge(valpartial, predDF)
def loadImage(idx, df, axis):
img_file=df.iloc[idx]['patientId']
image = dicom.dcmread(img_file).pixel_array
axis.imshow(image, cmap=plt.cm.bone)
bboxes = df.iloc[idx]['bboxes']
for bbox in bboxes:
x,y,w,h=bbox
rect1 = patches.Rectangle((x,y), w, h, linewidth=2, edgecolor='red', fill=False)
axis.add_patch(rect1)
bboxes=df.iloc[idx]['predbboxes']
for bbox in bboxes:
x,y,w,h=bbox
rect2 = patches.Rectangle((x,y), w, h, linewidth=2, edgecolor='green', fill=False)
axis.add_patch(rect2)
axis.set_title('Actual:Predicted\n' + labelDict.get(df.iloc[idx]['label']) +
':' +labelDict.get(1 if len(df.iloc[idx]['predbboxes'])>0 else 0))
def display_actual_preds(df):
cols=3
rows=4
f,axarr=plt.subplots(rows,cols,figsize=(18,10))
for r in range(rows):
for c in range(cols):
axis=axarr[r,c]
loadImage(np.random.randint(valpartial.shape[0]), df, axis)
plt.tight_layout()
display_actual_preds(combinedDF)
prec, rec, f1s, _ = prf(combinedDF['label'], combinedDF['predlbls'], average='binary')
print('The precision, recall and f1-score for the classification ofpneumonia data set is \
{}, {} and {} respectively'.format(round(prec,3), round(rec, 3), round(f1s, 3)))
The precision, recall and f1-score for the classification ofpneumonia data set is 0.835, 0.724 and 0.775 respectively